Metric Suffix Array For Large-Scale Similarity Search

نویسندگان

  • Hisham Mohamed
  • Stéphane Marchand-Maillet
چکیده

We propose the Metric Suffix Array (MSA), as a novel and efficient data structure for permutation-based indexing. The Metric Suffix Array follows the same principles as the suffix array. The suffix array is mainly used for text indexing. Here, we build the MSA as an alternative for large-scale content based information retrieval. We also show how the MSA is scalable for parallel and distributed architectures. We study the performance and efficiency of our algorithms in a large-scale context. Experimental results show fast response time with high efficiency and effectiveness.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data

SUMMARY With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20-90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments d...

متن کامل

Compressed Spaced Suffix Arrays

Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still su...

متن کامل

Acceleration of spoken term detection using a suffix array by assigning optimal threshold values to sub-keywords

We previously proposed a fast spoken term detection method that uses a suffix array data structure for searching large-scale speech documents. The method reduces search time via techniques such as keyword division and iterative lengthening search. In this paper, we propose a statistical method of assigning different threshold values to sub-keywords to further accelerate search. Specifically, th...

متن کامل

Sequence Covering for Efficient Host-Based Intrusion Detection

This paper introduces a new similarity measure, the covering similarity, that we formally define for evaluating the similarity between a symbolic sequence and a set of symbolic sequences. A pair-wise similarity can also be directly derived from the covering similarity to compare two symbolic sequences. An efficient implementation to compute the covering similarity is proposed that uses a suffix...

متن کامل

CSA++: Fast Pattern Search for Large Alphabets

Indexed pattern search in text has been studied for many decades. For small alphabets, the FM-Index provides unmatched performance, in terms of both space required and search speed. For large alphabets – for example, when the tokens are words – the situation is more complex, and FM-Index representations are compact, but potentially slow. In this paper we apply recent innovations from the field ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013